데이터 feature

Name

이름이 있는지 없는지

AnimalType

고양이인지 강아지인지

SexuponOutcome

모두 사용

AgeuponOutcome

13년 이상인 경우 통일시켜줌. 모두 사용

Breed

그 종에서 확실히 구분이 되는 종만 구분해서 사용

Color

색중에서 확실히 구분되는 색만 구분해서 사용

DateTime

시간만 유효.(년, 월, 일은 특징이 없음)

OutcomeType

OutcomeSubtype

필요없음

AnimalID

필요없음

OutcomeType 중 died 샘플이 많지 않음.

died까지 예측하는 모델과 died를 제외하고 예측하는 모델을 따로 만들어보자.



In [85]:

    
df = pd.read_csv('train.csv')



In [86]:

    
#이름 encoding
df.Name = df.Name.fillna(0)
df.Name[df.Name!=0] = 1



In [87]:

    
#type encoding  강아지면 1 고양이는 0
df.AnimalType = df.AnimalType.apply(lambda x: 1 if x=='Dog' else 0)



In [88]:

    
def check_over13years(x):
    if len(x)>7:
        if x[-5:] == 'years' and int(x[:-5])>=13:
            return 'over 13 years'
    
    return x



In [89]:

    
# ageuponoutcome. 13년 이상된 강아지는 13년 이상으로 통일시키자
# 데이터가 없는 건 0 years랑 굉장히 유사한 데이터. 통일시키기!
df.AgeuponOutcome = df.AgeuponOutcome.fillna('0 years')
df.AgeuponOutcome = df.AgeuponOutcome.apply(check_over13years)



In [90]:

    
#필요없는 column제거
df = df.drop(['AnimalID', 'OutcomeSubtype'], axis=1)



In [91]:

    
#종 분류. 아래 종만 사용!
breeds = ['Labrador Retriever', 
'German Shepherd', 
'Golden Retriever', 
'Beagle', 
'Bulldog', 
'Yorkshire Terrier', 
'Boxer', 
'Poodle', 
'Rottweiler', 
'Siberian Husky', 
'Maltese', 
'Persian',
'Maine Coon',
'Siamese',
'American Shorthair', 
'Swedish Vallhund',
'Finnish', 
'Catahoula', 
'Ridgeback', 
'Carolina', 
'Manx',
'Domestic Shorthair',
'Pit Bull',
'Chihuahua',
'Domestic Medium Hair',
'Domestic Longhair',
'Dachshund',
'Rat Terrier',
'Miniature Schnauzer',
'Cairn Terrier',
'Shih Tzu']



In [92]:

    
def check_in_breeds(x):
    for breed in breeds:
        if x.count(breed) > 0:
            return breed



In [93]:

    
# 원하는 종이면 그대로 두고 아니면 None으로 두기
df.Breed = df.Breed.apply(check_in_breeds)



In [94]:

    
## 색 개수 30개 이하인건 others로 빼기
list_color = list(df.Color)
list_color_over50 = []
for color in set(list_color):
    if list_color.count(color) >= 50:
        list_color_over50.append(color)



In [95]:

    
print('색 종류 ->',len(set(list_color)))
print('샘플이 50개 이상인 색 종류 -> ', len(list_color_over50))









    



색 종류 -> 366
샘플이 50개 이상인 색 종류 ->  60



In [96]:

    
def check_in_colors(x):
    if x in list_color_over50:
        return x



In [97]:

    
# 샘플이 50개 이상인 색이면 그대로 두고 아니면 None으로 채우기
df.Color = df.Color.apply(check_in_colors)



In [98]:

    
#시간 분류
df['hour'] = df.DateTime.apply(lambda x:x[11:13])



In [99]:

    
# 5~8까지 묶음, 20~22까지 묶음, 23~0 묶음, 나머지 그대로
def check_hour(x):
    if x in ['03', '05', '06', '07']:
        return '5_8'
    elif x in ['20', '21', '22']:
        return '20_22'
    elif x in ['23', '00']:
        return '23_0'
    else:
        return x



In [100]:

    
df.hour = df.hour.apply(check_hour)



In [101]:

    
df.hour.unique()









    Out[101]:





array(['18', '12', '19', '13', '17', '5_8', '15', '14', '11', '16', '23_0',
       '09', '10', '08', '20_22'], dtype=object)

Prediction

Logistic Regression, RandomForest, SVM 등을 사용하기

Died까지 예측하는 모형



In [102]:

    
X = df.drop('OutcomeType', axis=1)



In [103]:

    
y = df.OutcomeType



In [104]:

    
X_dummy = X_dummy = pd.get_dummies(X.ix[:, ['SexuponOutcome', 'AgeuponOutcome', 'Breed', 'Color', 'hour']])



In [105]:

    
X_dummy['Name'] = X.Name
X_dummy['AnimalType'] = X.AnimalType



In [106]:

    
from sklearn.cross_validation import train_test_split



In [107]:

    
# train test split
X_train, X_test, y_train, y_test = train_test_split(X_dummy, y, test_size=0.20, random_state=42)



In [108]:

    
from sklearn.ensemble import RandomForestClassifier

RandomForest



In [109]:

    
# using RandomForest
model_rf = RandomForestClassifier(n_estimators=30)
result_rf = model_rf.fit(X_train, y_train)
result_rf.score(X_test, y_test)









    Out[109]:





0.63224841002618781



In [110]:

    
df_importance = pd.DataFrame(zip(X_dummy.columns, model_rf.feature_importances_), columns=['colname', 'importance'])
df_importance.sort_values('importance', ascending=False)









    Out[110]:






  
    
      
      colname
      importance
    
  
  
    
      148
      Name
      0.057559
    
    
      0
      SexuponOutcome_Intact Female
      0.036039
    
    
      17
      AgeuponOutcome_2 months
      0.034948
    
    
      2
      SexuponOutcome_Neutered Male
      0.034786
    
    
      1
      SexuponOutcome_Intact Male
      0.031827
    
    
      3
      SexuponOutcome_Spayed Female
      0.030567
    
    
      77
      Color_Black/White
      0.021285
    
    
      149
      AnimalType
      0.020174
    
    
      19
      AgeuponOutcome_2 years
      0.018456
    
    
      142
      hour_17
      0.017932
    
    
      73
      Color_Black
      0.016367
    
    
      53
      Breed_Domestic Shorthair
      0.016320
    
    
      143
      hour_18
      0.015980
    
    
      49
      Breed_Chihuahua
      0.015542
    
    
      10
      AgeuponOutcome_1 year
      0.015454
    
    
      57
      Breed_Labrador Retriever
      0.014939
    
    
      141
      hour_16
      0.014232
    
    
      137
      hour_12
      0.014019
    
    
      140
      hour_15
      0.013881
    
    
      139
      hour_14
      0.013808
    
    
      138
      hour_13
      0.012561
    
    
      63
      Breed_Pit Bull
      0.012241
    
    
      136
      hour_11
      0.011129
    
    
      4
      SexuponOutcome_Unknown
      0.010868
    
    
      21
      AgeuponOutcome_3 months
      0.010414
    
    
      23
      AgeuponOutcome_3 years
      0.010218
    
    
      146
      hour_23_0
      0.010139
    
    
      90
      Color_Brown/White
      0.010057
    
    
      116
      Color_Tan/White
      0.009657
    
    
      121
      Color_White
      0.009642
    
    
      ...
      ...
      ...
    
    
      112
      Color_Sable/White
      0.001404
    
    
      94
      Color_Chocolate/Tan
      0.001343
    
    
      105
      Color_Lynx Point
      0.001334
    
    
      113
      Color_Seal Point
      0.001252
    
    
      89
      Color_Brown/Tan
      0.001206
    
    
      8
      AgeuponOutcome_1 week
      0.001152
    
    
      127
      Color_White/Gray
      0.001140
    
    
      98
      Color_Cream Tabby/White
      0.001090
    
    
      101
      Color_Flame Point
      0.001083
    
    
      131
      Color_White/Tricolor
      0.000988
    
    
      145
      hour_20_22
      0.000936
    
    
      102
      Color_Gold
      0.000919
    
    
      16
      AgeuponOutcome_2 days
      0.000834
    
    
      128
      Color_White/Orange Tabby
      0.000806
    
    
      66
      Breed_Ridgeback
      0.000717
    
    
      118
      Color_Torbie/White
      0.000714
    
    
      20
      AgeuponOutcome_3 days
      0.000686
    
    
      47
      Breed_Carolina
      0.000633
    
    
      28
      AgeuponOutcome_5 days
      0.000545
    
    
      5
      AgeuponOutcome_0 years
      0.000500
    
    
      58
      Breed_Maine Coon
      0.000423
    
    
      6
      AgeuponOutcome_1 day
      0.000404
    
    
      32
      AgeuponOutcome_6 days
      0.000380
    
    
      24
      AgeuponOutcome_4 days
      0.000330
    
    
      54
      Breed_Finnish
      0.000319
    
    
      30
      AgeuponOutcome_5 weeks
      0.000306
    
    
      60
      Breed_Manx
      0.000299
    
    
      62
      Breed_Persian
      0.000287
    
    
      42
      Breed_American Shorthair
      0.000153
    
    
      71
      Breed_Swedish Vallhund
      0.000105
    
  

150 rows × 2 columns



In [111]:

    
print(metrics.classification_report(y_test, model_rf.predict(X_test)))









    



             precision    recall  f1-score   support

   Adoption       0.68      0.74      0.71      2219
       Died       0.00      0.00      0.00        33
 Euthanasia       0.35      0.20      0.26       298
Return_to_owner       0.46      0.42      0.44       961
   Transfer       0.68      0.69      0.69      1835

avg / total       0.62      0.63      0.62      5346



In [112]:

    
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, model_rf.predict(X_test))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ RandomForest')
plt.colorbar()
outcomes = sorted(y_test.unique())
tick_marks = np.arange(len(set(list(y_test))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')









    



[[1643    1   21  275  279]
 [   3    0    3    1   26]
 [  43    1   60   62  132]
 [ 372    0   23  406  160]
 [ 354    0   63  147 1271]]






    Out[112]:





<matplotlib.text.Text at 0x109860790>



In [113]:

    
model_rf.predict_proba(X_test)









    Out[113]:





array([[ 0.4       ,  0.        ,  0.23333333,  0.26666667,  0.1       ],
       [ 0.66944444,  0.        ,  0.        ,  0.        ,  0.33055556],
       [ 0.        ,  0.        ,  0.03333333,  0.        ,  0.96666667],
       ..., 
       [ 0.99444444,  0.        ,  0.        ,  0.        ,  0.00555556],
       [ 0.        ,  0.01666667,  0.        ,  0.        ,  0.98333333],
       [ 0.35      ,  0.        ,  0.        ,  0.55      ,  0.1       ]])

Logistic Regression



In [114]:

    
from sklearn.linear_model import LogisticRegression



In [115]:

    
from sklearn import metrics



In [116]:

    
model_lr = LogisticRegression(C=1e5).fit(X_train, y_train)
print(metrics.classification_report(y_test, model_lr.predict(X_test)))









    



             precision    recall  f1-score   support

   Adoption       0.67      0.85      0.75      2219
       Died       1.00      0.03      0.06        33
 Euthanasia       0.52      0.11      0.18       298
Return_to_owner       0.49      0.41      0.45       961
   Transfer       0.73      0.66      0.69      1835

avg / total       0.65      0.66      0.64      5346



In [117]:

    
model_lr.score(X_test, y_test)









    Out[117]:





0.65918443696221474



In [118]:

    
cm = confusion_matrix(y_test, model_lr.predict(X_test))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ Logistic Regression')
plt.colorbar()
outcomes = sorted(y_test.unique())
tick_marks = np.arange(len(set(list(y_test))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')









    



[[1890    0    3  193  133]
 [   2    1    4    0   26]
 [  42    0   33   72  151]
 [ 431    0    4  395  131]
 [ 469    0   20  141 1205]]






    Out[118]:





<matplotlib.text.Text at 0x10b4d8350>



In [119]:

    
model_lr.predict_proba(X_test)









    Out[119]:





array([[  2.19453771e-01,   1.57030382e-03,   2.34032869e-01,
          4.21208808e-01,   1.23734248e-01],
       [  8.52733288e-01,   2.08851912e-03,   1.67755384e-02,
          4.46858506e-03,   1.23934070e-01],
       [  3.02549631e-05,   1.21143190e-06,   1.22835006e-01,
          4.58867370e-03,   8.72544854e-01],
       ..., 
       [  8.26766837e-01,   6.51437676e-04,   3.75934604e-03,
          1.15783977e-02,   1.57243982e-01],
       [  2.84419127e-06,   1.85681636e-02,   1.30407412e-01,
          3.67643513e-03,   8.47345145e-01],
       [  5.27098331e-01,   1.79364805e-07,   1.44817698e-02,
          4.04053472e-01,   5.43662476e-02]])



In [121]:

    
from sklearn.linear_model import Ridge, Lasso, ElasticNet



In [122]:

    
lasso1 = Lasso(alpha = 0).fit(X_train, y_train)









    



/Users/eunseopjeoung/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator
  if __name__ == '__main__':






    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-122-306dd26fd5fc> in <module>()
----> 1 lasso1 = Lasso(alpha = 0).fit(X_train, y_train)

/Users/eunseopjeoung/anaconda/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.pyc in fit(self, X, y, check_input)
    654         # when bypassing checks
    655         if check_input:
--> 656             y = np.asarray(y, dtype=np.float64)
    657             X, y = check_X_y(X, y, accept_sparse='csc', dtype=np.float64,
    658                              order='F',

/Users/eunseopjeoung/anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    472 
    473     """
--> 474     return array(a, dtype, copy=False, order=order)
    475 
    476 def asanyarray(a, dtype=None, order=None):

ValueError: could not convert string to float: Adoption



In [123]:

    
Lasso?



In [ ]:

Kernel SVM



In [65]:

    
from sklearn.svm import SVC



In [66]:

    
model_svc = SVC(probability=True).fit(X_train, y_train)



In [68]:

    
print(metrics.classification_report(y_test, model_svc.predict(X_test)))









    



             precision    recall  f1-score   support

   Adoption       0.60      0.95      0.73      2219
       Died       0.00      0.00      0.00        33
 Euthanasia       0.00      0.00      0.00       298
Return_to_owner       0.57      0.20      0.30       961
   Transfer       0.76      0.62      0.68      1835

avg / total       0.61      0.64      0.59      5346







    



/Users/eunseopjeoung/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [69]:

    
model_svc.score(X_test, y_test)









    Out[69]:





0.64216236438458663



In [71]:

    
cm = confusion_matrix(y_test, model_svc.predict(X_test))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ SVM')
plt.colorbar()
outcomes = sorted(y_test.unique())
tick_marks = np.arange(len(set(list(y_test))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')









    



[[2100    0    0   41   78]
 [   5    0    0    0   28]
 [ 100    0    0   33  165]
 [ 672    0    0  196   93]
 [ 625    0    0   73 1137]]






    Out[71]:





<matplotlib.text.Text at 0x111ddb910>

세가지를 Esemble로 사용해보자.

1. Voting형식



In [85]:

    
from sklearn.ensemble import VotingClassifier



In [84]:

    
clf1 = LogisticRegression(C=1e5, random_state=213)
clf2 = RandomForestClassifier(n_estimators=30, random_state=123)
clf3 = SVC(probability=True)



In [86]:

    
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft')



In [93]:

    
# 오래걸리니까 잠시후에 해보자
eclf.fit(X_train, y_train)









    Out[93]:





VotingClassifier(estimators=[('lr', LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=213,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('rf', Ran...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         voting='soft', weights=None)



In [94]:

    
eclf.score(X_test, y_test)









    Out[94]:





0.65544332210998879



In [95]:

    
predict_eclf = eclf.predict(X_test)



In [96]:

    
print(metrics.classification_report(y_test, predict_eclf))









    



             precision    recall  f1-score   support

   Adoption       0.66      0.84      0.74      2219
       Died       0.00      0.00      0.00        33
 Euthanasia       0.73      0.05      0.10       298
Return_to_owner       0.49      0.40      0.44       961
   Transfer       0.72      0.67      0.70      1835

avg / total       0.65      0.66      0.63      5346



In [99]:

    
cm = confusion_matrix(y_test, eclf.predict(X_test))

print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix')
plt.colorbar()
outcomes = sorted(y_test.unique())
tick_marks = np.arange(len(set(list(y_test))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')









    



[[1870    0    1  201  147]
 [   4    0    2    0   27]
 [  52    0   16   69  161]
 [ 434    0    1  388  138]
 [ 462    0    2  141 1230]]






    Out[99]:





<matplotlib.text.Text at 0x11353e610>

kaggle에 제출.



In [105]:

    
eclf_kaggle = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft')



In [109]:

    
X_dummy = X_dummy.drop('OutcomeType', axis=1)



In [110]:

    
eclf_kaggle.fit(X_dummy, y)









    Out[110]:





VotingClassifier(estimators=[('lr', LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=213,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('rf', Ran...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         voting='soft', weights=None)



In [112]:

    
df_test = pd.read_csv('test.csv')
df_test = df_test.drop(['ID'], axis=1)

#이름 encoding
df_test.Name = df_test.Name.fillna(0)
df_test.Name[df_test.Name!=0] = 1

#type encoding  강아지면 1 고양이는 0
df_test.AnimalType = df_test.AnimalType.apply(lambda x: 1 if x=='Dog' else 0)

# ageuponoutcome. 13년 이상된 강아지는 13년 이상으로 통일시키자
df_test.AgeuponOutcome = df_test.AgeuponOutcome.fillna('0 years')
df_test.AgeuponOutcome = df_test.AgeuponOutcome.apply(check_over13years)

#종
df_test.Breed = df_test.Breed.apply(check_in_breeds)

#색
df_test.Color = df_test.Color.apply(check_in_colors)

#시간
df_test['hour'] = df_test.DateTime.apply(lambda x:x[11:13])
df_test.hour = df_test.hour.apply(check_hour)
df_test = df_test.drop('DateTime', axis=1)



In [113]:

    
X_test_dummy = pd.get_dummies(df_test.ix[:, ['SexuponOutcome', 'AgeuponOutcome', 
                                             'Breed', 'Color', 'hour']])



In [114]:

    
X_test_dummy['Name'] = df_test.Name
X_test_dummy['AnimalType'] = df_test.AnimalType



In [115]:

    
print(X_test_dummy.columns[130:150])
print(X_dummy.columns[130:150])









    



Index([u'Color_White/Tan', u'Color_White/Tricolor', u'Color_Yellow',
       u'hour_08', u'hour_09', u'hour_10', u'hour_11', u'hour_12', u'hour_13',
       u'hour_14', u'hour_15', u'hour_16', u'hour_17', u'hour_18', u'hour_19',
       u'hour_20_22', u'hour_23_0', u'hour_5_8', u'Name', u'AnimalType'],
      dtype='object')
Index([u'Color_White/Tan', u'Color_White/Tricolor', u'Color_Yellow',
       u'hour_08', u'hour_09', u'hour_10', u'hour_11', u'hour_12', u'hour_13',
       u'hour_14', u'hour_15', u'hour_16', u'hour_17', u'hour_18', u'hour_19',
       u'hour_20_22', u'hour_23_0', u'hour_5_8', u'Name', u'AnimalType'],
      dtype='object')



In [154]:

    
result_predict = eclf_kaggle.predict(X_test_dummy)



In [155]:

    
df_result = pd.DataFrame(columns=['ID','Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'])
df_result['ID'] = range(1, len(result_predict)+1)
count = 1
for predict_val in result_predict:
    
    df_result.loc[df_result.ID==count, predict_val] = 1
    count+=1
df_result = df_result.fillna(0)
df_result.index = df_result.ID
df_result = df_result.drop('ID', axis=1)
df_result.to_csv('submission2.csv')

결합확률을 통한 앙상블



In [116]:

    
model_rf = RandomForestClassifier()
model_rf.fit(X_dummy, y)
model_lf = LogisticRegression()
model_lf.fit(X_dummy, y)
result_lf = model_lf.predict_proba(X_test_dummy)
result_rf = model_rf.predict_proba(X_test_dummy)
rs = (result_rf*result_lf)
np.argmax(rs[1])
model_rf.predict(X_test)
model_rf.predict_proba(X_test)
result_dict = {}
result_dict[0] = 'Adoption'
result_dict[1] = 'Died'
result_dict[2] = 'Euthanasia'
result_dict[3] = 'Return_to_owner'
result_dict[4] = 'Transfer'
rs2 = []
for i in rs.argmax(axis=1):
    rs2.append(result_dict[i])



In [118]:

    
df_result = pd.DataFrame(columns=['ID','Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'])
df_result['ID'] = range(1, len(rs)+1)
count = 1
for predict_val in rs2:
    
    df_result.loc[df_result.ID==count, predict_val] = 1
    count+=1
df_result = df_result.fillna(0)
df_result.index = df_result.ID
df_result = df_result.drop('ID', axis=1)
df_result.to_csv('submission2.csv')

Died 뺀 예측 모형.



In [72]:

    
# df2는 outcometype이 died인 샘플을 제외한 dataframe.
df2 = X_dummy
df2['OutcomeType'] = y



In [74]:

    
df2 = df2[df2.OutcomeType != 'Died']



In [76]:

    
X_dummy2 = df2.drop('OutcomeType', axis=1)



In [78]:

    
y2 = df2.OutcomeType



In [80]:

    
# train test split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_dummy2, y2, test_size=0.20, random_state=42)

RanomForest



In [81]:

    
# using RandomForest
model_rf2 = RandomForestClassifier(n_estimators=30)
model_rf2.fit(X_train2, y_train2)
model_rf2.score(X_test2, y_test2)









    Out[81]:





0.63576408517052951



In [83]:

    
print(metrics.classification_report(y_test2, model_rf2.predict(X_test2)))









    



             precision    recall  f1-score   support

   Adoption       0.68      0.75      0.71      2149
 Euthanasia       0.38      0.20      0.26       316
Return_to_owner       0.44      0.40      0.42       946
   Transfer       0.69      0.70      0.69      1896

avg / total       0.62      0.64      0.63      5307



In [90]:

    
cm = confusion_matrix(y_test2, model_rf2.predict(X_test2))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ randomforest')
plt.colorbar()
outcomes = sorted(y_test2.unique())
tick_marks = np.arange(len(set(list(y_test2))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')









    



[[1611   14  256  268]
 [  50   63   63  140]
 [ 359   25  379  183]
 [ 355   63  157 1321]]






    Out[90]:





<matplotlib.text.Text at 0x10cac7e10>

Logistic Regression



In [87]:

    
model_lr2 = LogisticRegression(C=1e5).fit(X_train2, y_train2)
print(metrics.classification_report(y_test2, model_lr2.predict(X_test2)))









    



             precision    recall  f1-score   support

   Adoption       0.66      0.84      0.74      2149
 Euthanasia       0.63      0.11      0.19       316
Return_to_owner       0.48      0.42      0.45       946
   Transfer       0.75      0.67      0.71      1896

avg / total       0.66      0.66      0.64      5307



In [89]:

    
model_lr2.score(X_test, y_test)









    Out[89]:





0.66086793864571647



In [91]:

    
cm = confusion_matrix(y_test2, model_lr2.predict(X_test2))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ logistic regression')
plt.colorbar()
outcomes = sorted(y_test2.unique())
tick_marks = np.arange(len(set(list(y_test2))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')









    



[[1803    0  211  135]
 [  46   36   64  170]
 [ 423    7  396  120]
 [ 460   14  147 1275]]






    Out[91]:





<matplotlib.text.Text at 0x111db9810>



In [98]:

    
eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft')



In [100]:

    
eclf2.fit(X_train2, y_train2)









    Out[100]:





VotingClassifier(estimators=[('lr', LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=213,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('rf', Ran...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         voting='soft', weights=None)



In [101]:

    
eclf2.score(X_test2, y_test2)









    Out[101]:





0.66968155266628981



In [102]:

    
predict_eclf2 = eclf2.predict(X_test2)



In [103]:

    
print(metrics.classification_report(y_test2, predict_eclf2))









    



             precision    recall  f1-score   support

   Adoption       0.66      0.86      0.75      2149
 Euthanasia       0.86      0.06      0.11       316
Return_to_owner       0.50      0.42      0.46       946
   Transfer       0.76      0.68      0.72      1896

avg / total       0.68      0.67      0.65      5307



In [104]:

    
cm = confusion_matrix(y_test2, predict_eclf2)
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ esemble')
plt.colorbar()
outcomes = sorted(y_test2.unique())
tick_marks = np.arange(len(set(list(y_test2))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')









    



[[1841    0  189  119]
 [  65   19   64  168]
 [ 424    2  397  123]
 [ 452    1  146 1297]]






    Out[104]:





<matplotlib.text.Text at 0x124a1c1d0>

kaggle에 제출



In [156]:

    
eclf_kaggle2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft')



In [157]:

    
eclf_kaggle2.fit(X_dummy2, y2)









    Out[157]:





VotingClassifier(estimators=[('lr', LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=213,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('rf', Ran...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         voting='soft', weights=None)



In [158]:

    
result_predict2 = eclf_kaggle2.predict(X_test_dummy)



In [159]:

    
df_result = pd.DataFrame(columns=['ID','Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'])
df_result['ID'] = range(1, len(result_predict2)+1)
count = 1
for predict_val in result_predict2:
    
    df_result.loc[df_result.ID==count, predict_val] = 1
    count+=1
df_result = df_result.fillna(0)
df_result.index = df_result.ID
df_result = df_result.drop('ID', axis=1)
df_result.to_csv('submission3.csv')



In [1]:









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-60b725f10c9c> in <module>()
----> 1 a

NameError: name 'a' is not defined



In [ ]:

	colname	importance
148	Name	0.057559
0	SexuponOutcome_Intact Female	0.036039
17	AgeuponOutcome_2 months	0.034948
2	SexuponOutcome_Neutered Male	0.034786
1	SexuponOutcome_Intact Male	0.031827
3	SexuponOutcome_Spayed Female	0.030567
77	Color_Black/White	0.021285
149	AnimalType	0.020174
19	AgeuponOutcome_2 years	0.018456
142	hour_17	0.017932
73	Color_Black	0.016367
53	Breed_Domestic Shorthair	0.016320
143	hour_18	0.015980
49	Breed_Chihuahua	0.015542
10	AgeuponOutcome_1 year	0.015454
57	Breed_Labrador Retriever	0.014939
141	hour_16	0.014232
137	hour_12	0.014019
140	hour_15	0.013881
139	hour_14	0.013808
138	hour_13	0.012561
63	Breed_Pit Bull	0.012241
136	hour_11	0.011129
4	SexuponOutcome_Unknown	0.010868
21	AgeuponOutcome_3 months	0.010414
23	AgeuponOutcome_3 years	0.010218
146	hour_23_0	0.010139
90	Color_Brown/White	0.010057
116	Color_Tan/White	0.009657
121	Color_White	0.009642
...	...	...
112	Color_Sable/White	0.001404
94	Color_Chocolate/Tan	0.001343
105	Color_Lynx Point	0.001334
113	Color_Seal Point	0.001252
89	Color_Brown/Tan	0.001206
8	AgeuponOutcome_1 week	0.001152
127	Color_White/Gray	0.001140
98	Color_Cream Tabby/White	0.001090
101	Color_Flame Point	0.001083
131	Color_White/Tricolor	0.000988
145	hour_20_22	0.000936
102	Color_Gold	0.000919
16	AgeuponOutcome_2 days	0.000834
128	Color_White/Orange Tabby	0.000806
66	Breed_Ridgeback	0.000717
118	Color_Torbie/White	0.000714
20	AgeuponOutcome_3 days	0.000686
47	Breed_Carolina	0.000633
28	AgeuponOutcome_5 days	0.000545
5	AgeuponOutcome_0 years	0.000500
58	Breed_Maine Coon	0.000423
6	AgeuponOutcome_1 day	0.000404
32	AgeuponOutcome_6 days	0.000380
24	AgeuponOutcome_4 days	0.000330
54	Breed_Finnish	0.000319
30	AgeuponOutcome_5 weeks	0.000306
60	Breed_Manx	0.000299
62	Breed_Persian	0.000287
42	Breed_American Shorthair	0.000153
71	Breed_Swedish Vallhund	0.000105